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Before entering into any more details about the methodological aspects, let's discuss the 
motivations behind the association of the two phrases "online (estimation)" and "Expectation- 
Maximisation (algorithm)" as well as their pertinence in the context of mixtures and more general 
models involving latent variables. 

The adjective online refers to the idea of computing estimates of model parameters on-the-fly, 
without storing the data and by continuously updating the estimates as more observations become 
available. In the machine learning literature, the phrase online learning has been mostly used re- 
cently to refer to a sp ecific way of analysing the perfo rmance of algorithms that incorporate obser- 



le p_€ 

vations progressively (jCesa-Bianchi and Lugosl 120061 ) . We dot not refer here to this approach and 



will only consider the more traditional setup in which the objective is to estimate fixed parameters 
of a statistical model and the performance is quantified by the proximity between the estimates 
and the parameter to be estimated. In signal processing and contr ol, the sort of algorithms con- 
sidered in the fol l owing is often referred to as adaptive or recursive (jLjung and Soderstroml . 1198.1 



Benveniste et al.1 . Il990l ). The word recursive is so ubiquitous in computer science that its use 



may be somewhat ambiguous and is not recommended. The term adaptive may refer to the type 
of algorithms considered in this chapter but is also often used in contexts where the focus is on 
the ability to track slow drifts or abrupt changes in the model parameters, which will not be our 
primary concerns. 



*To appear in Mixtures, edited by Kerrie Mengersen, Mike Titterington and Christian P. Robert. 
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Traditional applications of online algorithms involve situations in which the data cannot be 
stored, due to its volume and rate of sampling as in real-time signal processing or stream mining. 
The wide availability of very large datasets involving thousands or millions of examples is also 
at the origin of the current renewed interest in online algorithms. In this context, online algo- 
rithms are often more efficient — i.e., converging faster towards the target parameter value — and 
need fewer computer re source, in terms of memory or disk access, than their batch counterparts 
dNeal and Hintonl . Il99sh . In this chapter, we are interested in both contexts: when the online al- 
gorithm is used to process on-the-fly a potentially unlimited amount of data or when it is applied 
to a fixed but large dataset. We will refer to the latter context as the batch estimation mode. 

Our main interest is maximum likelihood estimation and although we may consider adding a 
penalty term (i.e., Maximum A Posteriori estimation), we will not consider "fully Bayesian" meth- 
ods which aim at sequentially simulating from the parameter posterior. The main motivation for 
this restriction is to stick to computationally simple iterations which is an essential requirement of 
successful online methods. In particular, when online algorithms are used for batch estimation, it is 
required that each parameter update can be carried out very efficiently for the method to be com- 
putationally competitive with traditional batch estimation algorithms. Fully Bayesian approaches 
— see, e.g., ( Chopin . 20021 ) — typically require Monte Carlo simulations even in simple models an d 
raise some challenging stability issues when used on very long data records ( Kantas et al. . 20091 ). 



This quest for simplicity of each of the parameter update is also the reason for focussing on 
the E M (Expectation-Maximisation) algorithm. Ever since its introduction by Dempster et al.l 
the EM algorithm has been criticised for its of ten sub - optim a l convergence behavi o ur an d 
many variants have been proposed by, among others, Lange (jl995l ). Meng and Van Dyk ( 19971 ). 
This being said, thirty years after the seminal paper by Dempster and his coauthors, the EM 
algorithm still is, by far, the most widely used inference tool for latent variable models due to its 
numerical stability and ease of implementation. Our main point here is not to argue that the EM 
algorithm is always preferable to other options. But the EM algorithm which does not rely on 
fine numerical tunings involving, for instance, line-searches, re-projections or pre-conditioning is 
a perfect candidate for developing online versions with very simple updates. We hope to convince 
the reader in the rest of this chapter that the online version of EM that is describ ed here shares 
many of the attractive properties of the original proposal of iDempster et al.1 (119771 ) and provides 
an easy to implement and robust solution for online estimation in latent variable models. 

Quite obviously, guaranteeing the strict likelihood ascent property of the original EM algorithm 
is hardly feasible in an online context. On the other hand, a remarkable property of the online 
EM algorithm is that it can reach asymptotic Fisher efficiency by converging towards the actual 
parameter value at a rate which is equivalent to that of the Maximum Likelihood Estimator 
(MLE). Hence, when the number of observations is sufficiently large, the online EM algorithm 
does becomes highly competitive and it not necessary to consider potentially faster-converging 
alternatives. When used for batch estimation, i.e., on a fixed large dataset, the situation is more 
contrasted but we will nonetheless show that the online algorithm does converge towards the 
maximum likelihood parameter estimate corresponding to the whole data. To achieve this result, 
one typically needs to scan the data set repeatedly. In terms of computing time, the advantages 
of using an online algorithm in this situation will typically depend on the size of the problem. For 
long data records however, this approach is certainly more recommendable than the use of the 
traditional batch EM algorithm and preferable to other alternatives considered in the literature. 

The rest of this chapter is organised as follows. The first section is devoted to the modelling 
assumptions that are adopted throughout the chapter. In Section El we consider the large-sample 
behaviour of the traditional EM algorithm, insisting on the concept of the limiting EM recursion 
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which is instrumental in the design of the online algorithm. The various aspects of the online EM 
algorithm are then examined in Sections and HI 

Although the chapter is intended to be self-contained, we will nonetheless assume that the 
reader is familiar with the fundamental concepts of classical statistical inference and, in particular, 
with Fisher in formation, exponential famili e s of distributions and maxim um likelihood estimation, 
at the level of Lehmann and Cas ella (ll998h . lBickel and Doksin"r] l|200d ) or equivalent texts. 



1 Model and Assumptions 

We assume that we are able to observe an independent sequence of identically distributed data 
(Yt)t>i, with marginal distribution ir. An important remark is that we do not necessarily assume 
that 7r corresponds to a distribution that is reachable by the statistical model that is used to fit 
the data. As discussed in Section [3.41 below, this distinction is important to analyse the use of the 
online algorithm for batch maximum-likelihood estimation. 

The statistical models that we consider are of the missing-data type, with an unobservable 
random variable Xt associated to each observation Yf. The latent variable Xt may be continuous 
or vector-valued and we will not be restricting o urselves to finite mixture models. Following the 



terminology introduced by lDempster et al.1 (119771 ) . we will refer to the pair (Xt, Y t ) as the complete 



data. The likelihood function is thus defined as the marginal 

fe(yt) = / pe(x t ,yt)dx t , 



where 6 E is the parameter of interest to be estimated. If the actual marginal distribution of 
the data belongs to the family of the model distributions, i.e., if it = fg^ for some parameter value 
the model is said to be well- specified; but, as mentioned above, we dot not restrict ourselves 
to this case. In the following, the statistical model (pe)8e& is assumed to verify the following key 
requirements. 

Assumption 1. Modelling Assumptions 

(i) The model belongs to a (curved) exponential family 

Pe(x t ,yt) = h(x t ,y t )exp((s(xt,yt),ip(0)) - A(0)) , (1) 

where s(xt,yt) is a vector of complete-data sufficient statistics belonging to a convex set S, 
(-, •) denotes the dot product and A is the log-partition function. 

(ii) The complete-data maximum-likelihood is explicit, in the sense that the function 9 defined 
by 

e-.s ^ e 

S h-> 9(S) = arg max (S, ib(6)) - A(6) 
8e& 

is available in closed-form. 
Assumption [1] defines t he context where the EM algorithm may be used directly (see, in 



particular, the discussion of lDempster et all Il977l ). Note that ([T]) is not restricted to the specific 



case of exponential family distributions in canonical (or natural) parameterisation. The latter 
correspond to the situation where ip is the identity function, which is particular in that pg is then 
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log-concave with the complete-data Fisher information matrix I p (0) being given by the Hessian 
VqA(9) of the log-partition function. Of course, if ip is an invertible function, one could use 
the reparameterisation n = ip{9) to recover the natural parameterisation but it is important to 
recognise that for many models of interest, ip is a function that maps low-dimensional parameters 
to higher-dimensional statistics. To illustrate this situation, we will use the following simple 
running example. 

Example 2 (Pr obabilistic PC A Mod e l). Co nsider the probabilistic Principal Component Analysis 
(PCA) model of Tipping and Bishoj, (199 A ). The model postulates that a d-dimensional observa- 
tion vector Y-t can be represented as 



Y t = uX t + X^Nt 



(2) 



where Nt is a centred unit-covariance d-dimensional multivariate Gaussian vector, while the latent 
variable Xt also is such a vector but of much lower-dimension. Hence, u is a d x r matrix with 
r «d. 

Eq. (J2J) is thus fully equivalent to assuming that Yt is a centred d-dimensional Gaussian vari- 
able with a structured covariance matrix given by £(#) = uv! + XI d, where the prime denotes 
transposition and Id is the d-dimensional identity matrix. Clearly, there are in this model many 
ways to estimate u and X that do n ot rely on the proba b ilistic model of ([2"])/ the standard PCA 
being probably the most well-known. \ Tipping and Bishoj. Il99<\ ) discuss several reasons for using 
the probabilistic approach that include the use of priors on the parameters, the ability to deal with 
missing or censored coordinates of the observations but also the access to quantitative diagnos- 
tics based on the likelihood that are helpful, in particular, for determining the number of relevant 
factors. 

To cast the model of ([2]) in the form given by ([T]) ; the complete- data model 

lu : «< W 




V u : uu' + Xld J 



must be reparameterised by the precision matrix £ 1 (6), yielding 



Pe(x t ,y t ) 



(27T) 



-d/2 



exp 



trace {^ 1 (9)s(x t ,y t )} - -log |E(0)| 




where s(xt,yt) is the rank one matrix 

s(x t ,yt) 



Hence in the probabilistic PCA model, the ip function in this case maps the pair (u, X) to the (d+1)- 
dimensional symmetric positive definite matrix Yet, Assumpt ion^ (ii) holds in this cas e 

and the EM algorithm can be used — see the general formulas given in (Tipping and Bishopl . 1999]) 
as well as the details of the particular case where r = 1 below. 

In the following, we will more specifically look at the particular case where r = 1 and u is 
thus a d-dimensional vector. This very simple case also provides a nice and concise i llustration of 



more complex situations, such as the Direction Of Arrival (DOA) model considered b mCappe et al. 
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(200 A ). The case of of a single factor PC A is also interesting as most required calculations can 
be done explicitly. In particular, it is possible to provide the following expression for using 
the block matrix inversion and Sherman-Morrison formulas: 



^1 + |M| 2 /A \ -u'/\^ 
V -u/A : \- l I d ) 



The above expression shows that one may redefine the sufficient statistics s(xt,yt) as consisting 



so 



lely of the scalars s^°\yt) = \\yt\\ 2 , s^(xt) = x t an d °f ^ e d-dimensional vector s lyl \xt,yt) 



xtyt- The complete-data log-likelihood is then given by 



logp e (x t ,y t ) = C 



.•it 



nd log X + 



iW(yt)-2u' S W(x t ,y t ) + sW(x t )\\u\ 
X 



(3) 



ignoring constant terms that do not depend on the parameters u and X. 

Note that developing an online algorithm for ([2]) in this particular case is equivalent to re- 
cursively estimating the largest eigenvalue of a covariance matrix from a series of multivariate 
Gaussian observations. We return to this example shortly below. 



2 The EM Algorithm and the Limiting EM Recursion 

In this section, we first review core elements regarding the EM algorithm that will be needed in the 
following. Next, we introduce the key concept of the limiting EM recursion which corresponds to 
the limiting deterministic iterative algorithm obtained when the number of observations grows to 
infinity. This limiting EM recursion is important both to understand the behaviour of classic batch 
EM when used with many observations and for motivating the form of the online EM algorithm. 



2.1 The Batch EM Algorithm 



In light of Assu mption [J and in particul ar of the assumed form of the likelihood in (pQ), the classic 
EM algorithm of lDempster et al.l (|1977l ) takes the following form. 

Algorithm 3 (Batch EM Algorithm). Given n observations, Y±, . . . ,Y n and an initial parameter 
guess 8q, do, for k > 1, 



E-Step 



M-Step 



1 - 

Sn,k = -'E,V 0k _ 1 [8{X t ,Y t )\Y t ] , 



t=l 



Ok = 8 {S; 



n,k ) 



(4) 



(5) 



Returning to the single factor PC A model of Example [21 it is easy to check from the expression 
of the complete-data log-likelihood in ([3]) and the definition of the sufficient statistics that the E- 
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step reduces to the computation of 



s i0 \Y t ) 



\Y\ 



E(9 



Y, 



s {2) (X t 



Y 



YtY(u 
\+\\u\\ 2 ' 
A (1» 2 



X + \\u\ 



(a + \\ u \\ 2 y 



(6) 



with the corresponding empirical averages 

si 



s 



.(0) _. 


1 


n,k 




n 


,(1) _. 


1 


n.k 




n 


-(2) 


1 


n,k 


n 



t=l 



t=l 

n 



Y 



Y 



t=i 



The M-step equations which define the function 9 are given by 



e (u) (s n , k ) = s^ k /s%, 



I f o(0) _ I, c'(l) |,2 / o(2) \ 
^ \°n,k \\°n,k\\ I °n,k J ■ 



(7) 



2.2 The Limiting EM Recursion 



Returning to the general case, a very important remark is that Algorithm [3] can be fully reparam- 
eterised in the domain of sufficient statistics, reducing to the recursion 



1 n 



t=l 



with the convention that S n fi is such that #o = 6(S n o)- Clearly, if an uniform (wrt. S £ S) law 
of large numbers holds for the empirical averages of 'E$rg\ [s(X t ,Y t )\ Y t ], the EM update tends, as 
the number of observations n tends to infinity, to the deterministic mapping T$ on S defined by 



T S (S) = E 7T (E ffw [ 8 (X 1 ,Yi)|y 1 ]) 



(8) 



Hence, the sequence of EM iterates {S n ^) k >i converges to the sequence (Tg(So))k>i, which is 
deterministic except for the choice of Sq. We refer to the limiting mapping Tg defined in ([8]), as 
the limiting EM recursion. Of course, this mapping on S also induces a mapping Tq on by 
considering the values of 6 associated to values of S by the function 9. This second mapping is 
defined as 

T e (9) = 9{E 7T (E e [s(X 1 ,Y 1 )\Y 1 })} . (9) 



Using exactly the same arguments as those of lDempster et al.1 (119771 ) for the classic EM algorithm, 
it is straightforward to check that under suitable regularity assumptions, Tq is such that 
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1. The Kullback-Leibler divergence D(ir\fo) = J log J fl ^ 7r(x)dx is a Lyapunov function for the 
mapping T@, that is, 

D{*\f Te w) <D(*\f e ) ■ 

2. The set of fixed points of Te, i.e., such that Tq(6) = 9, is given by 

{9: VeD(ir\fg) = 0}, 

where Vg denotes the gradient. 

Obviously, ([9]) is not directly exploitable in a statistical context as it involves integrating under 
the unknown distribution of the observations. Th is limiting EM recu rsion can however be used 



in the context of adaptive Monte Carlo methods (ICappe et all I2008T) and i s kno wn in machine 



learning as part of the information bottleneck framework ( Slonim and Weiss . 20031 ). 



2.3 Limitations of Batch EM for Long Data Records 

But the main interest of ([9]) is to provide a clear understanding of the behaviou r of the EM algo- 



rithm when used with very long data records, justifying much of the intuition of lNeal and Hinton 



(1999). Note that a large part of the post 1980's literature on the EM algorithm focusses on 



accelerating convergence towards the MLE for a fixed data size. Here, we consider the related 
but very different issue of understanding the behaviour of the EM algorithm when the data size 
increases. 

Figure [T] displays the results obtained with the batch EM algorithm for the single component 
PCA model estimated from data simulated under the model with u being a 20-dimensional vector 
of unit norm and A = 5. Both u and A are treated as unknown parameters but only the estimated 
squared norm of u is displayed on Figure HJ as a function of the number of EM iterations and 
for two different data sizes. Note that in this very simple model, the observation likelihood being 
given the AT(0, uu' + Aid) density, it is straightforward to check that the Fisher information for 
the parameter \\u\\ 2 equals {2(A + ||it|| ) 2 }~ , which has been used to represent the asymptotic 
confidence interval in grey. The width of this interval is meant to be directly comparable to that 
of the whiskers in the box-and-whisker plots. Boxplots are used to summarise the results obtained 
from one thousand independent runs of the method. 

Comparing the top and bottom plots clearly shows that when the number of iterations in- 
creases, the trajectories of the EM algorithm converge to a fixed deterministic trajectory which is 
that of the limiting EM recursion (Te(0o))fc>i- Of course, this trajectory depends on the choice 
of the initial point 6q which was fixed throughout this experiment. It is also observed that the 
monotone increase in the likelihood guaranteed by the EM algorithm does not necessarily imply a 
monotone convergence of all parameters towards the MLE. Hence, if the number of EM iterations 
is kept fixed, the estimates returned by the batch EM algorithm with the larger number of obser- 
vations (bottom plot) are not significantly improved despite the ten-fold increase in computation 
time (the E step computations must be done for all observations and hence the complexity of the 
E-step scales proportionally to the number n of observations). From a statistical perspective, the 
situation is even less satisfying as estimation errors that were not statistically significant for 2,000 
observations can become significant when the number of observations increases. In the upper plot, 
interrupting the EM algorithm after 3 or 4 iterations does produce estimates that are statistically 
acceptable (comparable to the exact MLE) but in the lower plot, about 20 iterations are needed to 
achieve the desired precision. As sugg ested bv lNeal and Hintonl (|l999h . this paradoxical situation 



can to some extent be avoided by updating the parameter (that is, applying the M-step) more 
often, without waiting for a complete scan of the data record. 
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Batch EM iterations 



Figure 1: Convergence of batch EM estimates of || U 1 1 ELS cl function of the number of EM iterations 
for 2,000 (top) and 20,000 (bottom) observations. The box-and-whisker plots (outliers plotting 
suppressed) are computed from 1,000 independent replications of the simulated data. The grey 
region corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic 
Gaussian approximation of the MLE. 



3 Online Expectation-Maximisation 



3.1 The Algorithm 

The limiting-EM argument developed in the following section shows that when the number of 
observations tends to infinity, the EM algorithm is trying to locate the fixed points of the mapping 
Tg defined in ([8]), that is, the roots of the equation 



eJe- 9{s) [ s (x 1 ,y 1 )\y 1 ])-s = o. 



(10) 



Although we cannot compute the required expectation wrt. the unknown distribution ir of the 
data, each new observation provides us with an unbiased noisy observation of this quantity through 
Efl/m [s(X n ,Y n )\ Y n ]. Solving (flOl) is thus an instance of the most basic case where the Stochastic 
Approximation (or Robbins-Monro) method can be u sed. The literatu r e on S tochastic Approx- 
i matio n is huge but we recommend the textbooks by iBenveniste et al. I dl99Ch. iKushner and Yinl 
:'or more details and examples as well as the review paper by lLail (J2003) for a historical 
perspective. The standard stochastic approximation approach to approximate the solution of f|10() 
is to compute 



S n = S n -i + 7 n (E 



J e(s n 



l) [ s (x n ,y n )|y ri ]-s , ri _i) 



(ii) 



for n > 1, So being arbitrary and (7 n )n>i denoting a sequence of positive stepsizes that decrease 
to zero. This equation, rewritten in an equivalent form below, is the main ingredient of the online 
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EM algorithm. 



Algorithm 4 (Online EM Algorithm). Given So, do an d a sequence of step sizes {^ n )n>i, do, for 
n > 1, 



Stochastic E-Step 
M-Step 



S n = (1 - ln)S n -l + 7 nE 0n _ 1 [s(X n , Y n )\ Y n ] , (12) 



On = e(S n ) . (13) 



Rewriting (|1Q|) under the form displayed in (|12p is very enlightening as it shows that the new 
statistic S n is obtained as a convex combination of the previous statistic S n -i and of an update 
that depends on the new observation Y n . In particular it shows that the stepsizes 7 n have a natural 
scale as their highest admissible value is equal to one. This means that one can safely take 71 = 1 
and that only the rate at which the stepsize decreases needs to be selected carefully (see below). 
It is also observed that if 71 is set to one, the initial value of So is never used and it suffices to 
select the initial parameter guess 60', this is the approach used in the following simulations. 

The only adjustment to Algorithm 0] that is necessary in practice is to omit the M-step of (I13p 
for the first observations. It typically takes a few observations before the complete-data maximum 
likelihood solutio n is well-defined and the par ameter update should be inhibited during this early 



phase of training (jCappe and Moulinesl . 120091 ) . In the simulations presented below in Sections 



and 13.41 the M-step was omitted for the first five observations only but in most complex scenarios 
a longer initial parameter freezing phase may be necessary. 

When 71 < 1, it is tempting to interpret the value of So as being associated to a prior on the 
parameters. Indeed, the choice of a conjugate prior for 9 in the exponential family defined by (pQ) 
does result in a complete-data Maximum A Posteriori (MAP) estimate of 9 given b y 9(S n] k + So/n ) 



(instead of #(<SVt,fc) for the MLE), where So is the hyper-parameter of the prior (jRobertl . l200ll ). 
However, it is easily checked that the influence of So i n S n decreases as njc=i(l ~~ Ik)-, which for 
the suitable stepsize decrease schemes (see beginning of Section [3721 below) . decreases faster than 
1/n. Hence, the value of So has a rather limited impact on the convergence of the online EM 
algorithm. To achieve MAP estimation (assuming a conjugate prior on 9) it is thus recommend 
to replace (fT3|) by 8 n = 9{S n + So/n), where So is the hyperparameter of the prior on 9. 

A last remark of importance is that Algorithm |4] can most naturally be interpreted as a 
stochastic approximation recursion on the sufficient statistics rather than on the parameters. There 
does not exists a similar algorithm that operates directly on the parameters because unbiased 
approximations of the rhs. of based on the observations Yi are not easily available. As we will 
see below, Algorithm U] is asymptotically equivalent to a gradient recursion on 9 n that involves an 
additional matrix weighting which is not necessary in (I12D . 

3.2 Convergence Properties 

Under the assumption that the stepsize seque nce satisfies y~l n 7 n = 00, Y^, n 7^ < 00 and other reg- 
ularity hypotheses that are omitted here (see Cappe and Moulinesl . 2009 for details) the following 



properties characterise the asymptotic behaviour of the online EM algorithm, 
(i) The estimates 9 n converge to the set of roots of the equation VQD(n\fg) = 0. 



9 



(ii) The algorithm is asymptotically equivalent to a gradient algorithm 



9 n = n _i + 7nJ-\0 n ^)Ve log f 9n ^(Y n ) , (14) 

where J (9) = -E* (E„ [V 2 e log p 9 {X x 

(iii) For a well specified model (i.e., if n = fg^) and under Polyak-Ruppert averaging, 8 n is 
Fisher efficient: sequences that do converge to 9* are such that \fn{9 n — 6+) converges in 
distribution to a centred multivariate Gaussian variable with covariance matrix I n (8+), where 
I w (6) = — E^fVa log f$(Yi)] is the Fisher information matrix corresponding to the observed 
data. 

Polyak-Ruppert averaging refers to a postprocessing step which simply consists in replacing 
the estimated parameter values 9 n produced by the algorithm by their average 



E 



n — rin 

u t=n +l 



wher e no is a positive index at which averaging is started (jPolvak and Juditskvl . Il992l . iRuppert , 



19881 ). Regarding the statements (i) and (ii) above, it is important to understand that the limiting 



estimating equation VgD(n\fg) = may have multiple solutions, even in well-specified models. In 
practice, the most important factor that influences the convergence to one of the stationary points 
of the Kullback-Leibler divergence D(ir\fg) rather than the other is the choice of the initial value 9q. 
An additional important remark about (i)-(iii) is the fact that the asymptotic equivalent gradient 
algorithm in (|14p is not a practical algorithm as the matrix J{9) depends on ir and hence cannot 
be computed. Note also that J{8) is (in general) neither equal to the complete-data information 
matrix I p (9) nor to the actual Fisher information in the observed model I n (9). The form of J (9) 
as well as its ro le to approximate the convergence behaviour of the EM algorithm follows the idea 
of lLangj (| 19951 ). 



From our experience, it is generally sufficient to consider stepsize sequences of the form 7 n = 
l/n a where the useful range of values for a is from 0.6 to 0.9. The most robust setting is obtained 
when taking a close to 0.5 and using Polyak-Ruppert averaging. The latter however requires 
to chose an index no that is sufficiently large and, hence, some idea of the convergence time is 
necessary. To illustrate these observations, Figure [2] displays the results of online EM estimation 
for the single component PCA model, exactly in the same conditions as those considered previously 
for batch EM estimation in Section 12.31 

From a computational point of view, the main difference between the online EM algorithm and 
the batch EM algorithm of Section 12.11 is that the online algorithm performs the M-step update 
in ([7]) after each observation, according to (|13p . while the batch EM algorithm only applies the 
M-step update after a complete scan of all available observations. Both algorithms however require 
the computation of the E-step statistics following ([6]) for each observation. In batch EM, these 
local E-step computation are accumulated, following @, while the online algorithm recursively 
averages these according to (I12p . Hence, as the computational complexity of the E and M steps 
are, in this case, comparable, the computational complexity of the online estimation is equivalent 
to that of one or two batch EM iterations. With this in mind, it is obvious that the results of 
Figure [2] compare very favourably to those of Figure [1] with an estimation performance that is 
compatible with the statistical uncertainty for observation lengths of 2,000 and larger (last two 
boxplots on the right in each display). 
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Figure 2: Online EM estimates of ||u|| 2 for various data sizes (200, 2,000 and 20,000 observations, 
from left to right) and algorithm settings (a = 0.9, a = 0.6 and a = 0.6 with Polyak-Ruppert 
averaging, from top to bottom). The box-and- whisker plots (outliers plotting suppressed) are 
computed from 1,000 independent replications of the simulated data. The grey regions corresponds 
to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussian approximation 
of the MLE. 
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Regarding the choice of the stepsize decrease exponent a, it is observed that while the choice 
of a = 0.6 (middle plot) does result in more variability than the choice of a = 0.9 (top plot), 
especially for smaller observation sizes, the long-run performance is somewhat better with the 
former. In particular, when Polyak-Ruppert averaging is used (bottom plot), the performance for 
the longest data size (20,000 observations) is clearly compatible with the claim that online EM 
is Fisher-efficient in this case. A practical concern associated with averaging is the choice of the 
initial instant no where averaging starts. In the case of Figure [21 we choose no to be equal to 
half the length of each data record, hence averaging is used only on the second half of the data. 
While it produces refined estimates for the longer data sizes, one can observe that the performance 
is rather degraded for the smallest observation size (200 observations) due to the fact that the 
algorithm is still very far from having converged after just 100 observations. Hence, averaging is 
efficient but does require to chose a value of no that is sufficiently large so as to avoid introducing 
a bias due to the lack of convergence. 

In our experience, the fact that choices of a close to 0.5 are more reliable than values closer 
to the upper limit of 1 is a very constant observation. It may come to some surprise for readers 
familiar with the gradient descent algorithm as used in numerical optimisation, which shares some 
similarity with (|12j) . In this case, it is known t hat the optimal stepsize choice is of the form 
ln = a(n + brHor a broad class of functions fe^, »). However, the situation here 



is very different as we do not observe exact gradient or expectations but only noisy version of 
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Figure 3: Four superimposed trajectories of the estimate of u\ (first component of u) for various 
algorithm settings (a = 0.9, a = 0.6 and a = 0.6 with Polyak-Ruppert averaging, from top to 
bottom). The actual value of u\ is equal to zero. 



themE). Figure [3] shows that while the trajectory of the parameter estimates appears to be much 
rougher and variable with a = 0.6 than with a = 0.9, the bias caused by the initialisation is also 
forgotten much more rapidly. It is also observed that the use of averaging (bottom display) makes 
it possible to achieve the best of both worlds (rapid forgetting of the initial condition and smooth 
tra jectories). 

Liang and Klein (l2009h considered the performance of the online EM algorithm for large-scale 



natural language processing applications. This domain of application is characterised by the use 
of very large-dimensional models, most often related to the multinomial distribution, involving 
tens of thousands of different words and tens to hundreds of semantic tags. As a consequence, 
each observation, be it a sentence or a whole document, is poorly informative about the model 
parameters (typically a given text contains o n ly a very limited portion of the whole available 
vocabulary). In this context, Liang and Klein ( 2009I ) found that the algorithm was highly com- 
petitive with other approaches but only when combined with mini-batch blocking: rather than 
applying Algorithm |4] at the observation scale, the algorithm is used on mini-batches consisting of 
m consecutive observation s (y m (k-i)+i^mk+2 ■ ■ ■ ,Y m k)k>l- For the models and data considered 
by Liang and Klein ( 2009I ). values of m of up to a few thousands yielded optimal performance. 
More generally, mini-batch blocking can be useful in dealing with mixture-like models with rarely 
active components. 



1 For a complete-data exponential family model in natural parameterisation, it is easily checked that 
Eg [s(X n , Y n )\ Y n ] = Eg [s (X n , Y n )] + Ve log fg(Y n ) and hence that the recursion in the space of sufficient statis- 
tics is indeed very close to a gradient ascent algorithm. However, we only have access to Vg log fo{Y n ) which is a 
noisy version of the gradient of the actual limiting objective function D(ir\fe) that is minimised. 
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3.3 Application to Finite Mixtures 

Although we considered so far only the simple case of Example [2] which allows for the compu- 
tation of the Fisher information matrix and hence for quantitative assessment of the asymptotic 
performance, the online EM algorithm is easy to implement in models involving finite mixture of 
distributions. 

Figure 0] displays the Bayesian network representation corresponding to a mixture model: for 
each observation Yt there is an unobservable mixture indicator or allocation variable Xt that 
takes its value in the set {l,...,m}. A typical parameterisation for this model is to have a 
separate sets of parameters u> and f3 for, respectively, the parameters of the prior on Xt and of the 
conditional distribution of Yt given Xt. Usually, to = (u/ 1 ), . . . , u/ m )) is chosen to be the collection 
of component frequencies, o;W = P s (Xt = i), and hence u is constrained to the probability simplex 
(w^ > and Y^aLi 1 ^ = The observation pdf is most often parameterised as 

m 



fe{yt) = ^2^gp{i)(yt), 



i=l 

where (g\)\eA is a parametric family of probability densities and /3 = (f3^\ . . . , f3^) are the 
component-specific parameters. We assume that g\(yt) has an exponential family representation 
similar to that of (pQ) with sufficient statistic s{yt) and maximum likelihood function A : S t— >• A, 
which is such that Ylt=i s (^*)) = ar g max AgA ^ Ylt=i ^°S9x(Yt)- It is then easily checked that 
the complete-data likelihood pg belongs to an exponential family with sufficient statistics 

s M(X t ) = t{X t = i}, 

(X t ,Y t ) = t{X t = i}s(Y t ) , for i = 1, . . . , m. 

And the function 6{S) can then be decomposed as 

/?« = A (s^/S^A , for i = 1, . . . , m.. 
Hence the online EM algorithm takes the following specific form. 

Algorithm 5 (Online EM Algorithm for Finite Mixtures). Given So, Oq and a sequence of step sizes 
{ln)n>i, do, for n>l, 

Stochastic E-Step Compute 

Un-l9 B {i) {Yt) 

Pe n _ x {X t = i\Y t ) 



and 



St l) = (1 - 7n)4-? + lnPe n ^ {X t = i\Y t ) , 

= (1 - 7n )S<J? + lns{Y t )Pe n _ 1 {X t = i\Y t ) , (15) 



for i = 1, . . . , m. 
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Figure 4: Bayesian network representation of a mixture model. 



M-Step 



4 i} = 4 w,i) > 

0« = A (s^/St A 



(16) 



Example 6 (Online EM for Mixtures of Poisson). We consider a simplistic i nstance of Algorithm 



corre sponding to the mixture of Poisson distribution (see also Section 2.4 o nCappe and Moulinea . 
\200d ). In the case of the Poisson distribution, g\{yt) = ^-je ^Xf, the sufficient statistic reduces to 
s (lJt) = Ut and the MLE function A also is the identity X(S) = S. Hence, the online EM recursion 
for this case simply consists in instantiating (I15p - (|16p as 



= (1 - 7n )^J + ln P Qn _ x {X t = i\Y t ) , 
= (1 - 7n )4-i + InYtPe^iXt = i\Y t ] 

and 



(i) 



p(i) — c(A,i) /c(ui,i) 
r^n n I n 



3.4 Use for Batch Maximum-Likelihood Estimation 

An interesting issue that deserves some more comments is the use of the online EM algorithm 
for batch estimation from a fixed data record Y\, . . . , Yjy. In this case, the objective is to save on 
computational effort compared to the use of the batch EM algorithm. 

The analysis of the convergence behaviour of online EM in this context is made easy by the 
following observation: Properties (i) and (ii) stated at the beginning of Section 13.21 do not rely on 
the assumption that the model is well-specified (i.e., that vr = /^) and can thus be applied with 
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7r being the empirical distribution -K^(dy) = ■h YltLi ^Y t {dy) associated to the observed sampl^l. 
Hence, if the online EM algorithm is applied by randomly drawing (with replacement) subsequent 
"pseudo-observations" from the finite set {Y±, . . . , Yzv}, it converges to points such that 



[ N 

v e D(ir N \f e ) = -_v ^io g / e (y t ) = o, 



t=i 

that is, stationary points of the log- likelihood of the observations Y\, . . . , Y^. Property (ii) also 
provides an asymptotic equivalent of the online EM update, where the index n in (|14p should be 
understood as the number of online EM steps rather than the number of actual observation, which 
is here fixed to N. 

In practice, it doesn't appear that drawing randomly the pseudo-observations make any real 
difference when the observations Yi, ... , Y/v are themselves already independent, except for very 
short data records. Hence, it is more convenient to scan the observation systematically in tours 
of length N in which each observation is visited in a predetermined order. At the end of each 
tour, k = n/N is equal to the number of batch tours completed since the start of the online EM 
algorithm. 

To compare the numerical efficiency of this approach with that of the usual batch EM it is 
interesting to bring together two results. For online EM, based on (|14p . it is possible to show that 
yjn(6 n — 6>oo) converges in distribution to a multivariate Gaussian distribution, where #oo denotes 
the limit of #J1. In contrast, the batch EM algorithm achieves so-called linear convergence which 
means that for sufficiently large fc's, there exists p < 1 s uch that \\dk — flop 1 1 < p k . where Ok de notes 



the parameter estimated after k batch EM iterations ( Dempster et al. . 1977 . Lange, 19951 ). In 



terms of computing effort, the number k of batch EM iterations is comparable to the number 
k = n/N of batch tours in the online EM algorithm. Hence the previous theoretical results 
suggest that 

• If the number of available observations ./V is small, batch EM can be way faster than the online 
EM algorithm, especially if one wants to obtain a very accurate numerical approximation 
of the MLE. Note that from a statistical viewpoint, this may be unnecessary as the MLE 
itself is only a proxy for the actual parameter value, with an error that is of order 1 /y/~N in 
regular statistical models. 

• When N increases, the online EM algorithm becomes preferable and, indeed, arbitrary so 
if N is sufficient large. Recall in particular from Section 13.21 that when N increases, the 
online EM estimate obtained after a single batch tour is asymptotically equivalent to the 
MLE whereas the estimate obtained after a single batch EM iteration converges to the 
deterministic limit Tq(0q). 

In their pioneering work on this topic, Neal and Hinton ( 19991 ) suggested an algorithm called 



incremental EM as an alternative to the batch EM algorithm. The incremental EM algorithm 
turns out to be exactly equivalent to Algorithm U] used with 7„ = 1/n up to the end of the batch 
tour only. After this initial batch scan, the incremental EM proceeds a bit differently by replacing 
one by one the previously computed values of Eg t _ 1 [s(X t , Y t )\ Y t ] with Eg t _ 1+ ^-l)JV i s (^t, Y t )\ Y t ] 
when processing the observation at position t for the k-th time. This incremental EM algorithm is 
indeed more efficient than batch EM, although they do not necessarily have the same complexity — 
a point that will be further discussed in the next section. For large values of N however, incremental 



2 Where S u (dy) denotes the Dirac measure localised in u. 

3 Strictly speaking, this has been shown only for random batch scans. 



15 



EM becomes impractical (due to the use of a storage space that increases proportionally to N) 
and less recommendabl e than the online EM algorithm as shown by the following example (see 
also the experiments of Liang and Klein . 20091 for similar conclusions). 
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T T 
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Online EM 



T 



T 



batch tours 



Figure 5: Normalised log-likelihood of the estimates obtained with, from top to bottom, batch 
EM, incremental EM and online EM as a function of the number of batch tours (or iterations, for 
batch EM). The data is of length N = 100 and the box an whiskers plots summarise the results 
of 500 independent runs of the algorithms started from randomised starting points 9q. 

Figure [5] displays the normalised log-likelihood values corresponding to the three estimation 
algorithms (batch EM, incremental EM and online EM) used to estimate the parameters of a 
mixture of two Poisson distributions (see Example [6] for implementation details)@. All data is 
simulated from a mixture model with parameters uj^ = 0.8, = 1 and = 3. In this 
setting, where the sample size is fixed, it is more difficult to come up with a faithful illustration 
of the merits of each approach as the convergence behaviour of the algorithms depend very much 
on the data record and on the initialisation. In Figures [5] and [6j the two data sets were kept fixed 
throughout the simulations but the initialisation of both Poisson means was randomly chosen from 
the interval [0.5,5]. This randomisation avoids focussing on particular algorithm trajectories and 
gives a good idea of the general situation, although some variations can still be observed when 
varying the observation records. 

Figure [5] corresponds to the case of an observation sequence of length N = 100. In this case it 
is observed that the performance of incremental EM dominates that of the other two algorithms, 
while the online EM algorithm only becomes preferable to batch EM after the second batch tour. 
For the data record of length N = 1,000 (Figure online EM now dominates the other two 

4 Obviously, the fact that only log-likelihoods normalised by the length of the data are plotted hides some impor- 
tant aspects of the problem, in particular the lack of identifiability caused by the unknown labelling of the mixture 
components. 
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Figure 6: Same display as in Figure [5] for a data record of length N = 1,000. 



algorithms, with incremental still being preferable to batch EM. Notice that it is also the case 



after the first batch tour, illustrating our claim that the choice of 7 n 



n 



-0.6 



used here for the 



online EM algorithm is indeed preferable to the value of 7 n = n _1 , which coincides with the update 
used by the incremental EM algorithm during the first batch tour. Finally, one can observe on 
Figure E] that even after five EM iterations there are a few starting values for which batch EM or 
incremental EM gets stuck in regions of low normalised log-likelihood (around -1.6). These indeed 
correspond to local maxima of the log- likelihood and some trajectories of batch EM converge to 
these regions, depending on the value of initialisation. The online EM with the above choice of 
step-size appears to be less affected by this issue, although we have seen that in general only 
convergence to a stationary point of the log-likelihood is guaranteed. 



4 Discussion 



We conclude this chapter by a discussion of the online EM algorithm. 

First, the approach presented here is not the only option for online para meter estimation 
in latent variable models. One of the earliest and most widely used (see, e.g., Liu et al. . 2006) 



algorithm is that of iTitteringtonl ()1984l ) consisting of the following gradient update: 



e n = + 7n /- 1 (0 ri _ 1 )v e iog/0 ri _ 1 (y ri ) , 



(17) 



where the matrix I p refers to the complete-data Fisher information matrix. For complete-data 
models in natural parameterisation — with ip = 1 in ([1]), I p (0) coincides with VgA(6) and, as it 
does not depend on X.% or Y%, is also equal to the matrix J(0) that appears in (|14p . Thus, Titter- 
ington's algorithm is in this case asymptotically equivalent to online EM. In other cases, where I p 
and J differs, the recursion of ()17p converges at the same rate as the online EM algorithm but is 
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not Fisher-efficient. Another difference is the way Algorithm 0] deals with parameter constraints. 
Assumption [T] implies that the M-step update, taking into account the possible parameter con- 
straints, is explicit. Hence, 9 n = 9{S n ) does satisfy the parameter constraint by definition of the 
function 9. In the case of Example [6] for instance, the proposed update does guarantee that the 
mixture weight vector uj stays in the probability simplex. This is not the case with the update 
of (117ft which requires reparameterisation or reprojection to handle possible para meter constraints. 

As discussed in Section [3^1 the online EM algorithm is inspired by the work of lNeal and Hinton 
but distinct from their increme ntal EM app roach . To the best o f our k nowledge, the online 
EM algorithm was first proposed by ISatol (|2OO0h and ISato and Ishiil (|2000h who described the 
algorithm and provided some analysis of convergence in the case of exponential families in natural 
parameterisation and for the case of mixtures of Gaussians. 

In Section 13.41 we have seen that the online EM algorithm is preferable to batch EM, in terms 
of computational effort, when the observation size is sufficiently large. To operate in batch mode, 
the online EM needs to be used repeatedly by scanning the data record several times so as to 
converge toward the maximum likelihood estimate. It is important to note however that one 
iteration of batch EM operating on N observations requires N individual E-step computations 
but only one application of the M-step update; whereas online EM applies the M-step update 
at each observation and, hence, requires N M-step updates per complete batch scan. Thus, the 
comparison between both approaches also depends on the respective numerical complexities of the 
E and M steps. The online approach is more attractive for models in which the M-step update 
is relatively simple. The availability of parallel computing resources would be most favourable 
to batch EM for which the E-step computations pertaining to each observation may be done 
independently. In contrast, in the online approach the computations are necessarily sequential as 
the parameter is updated when processing each observation. 

As indicated in the beginning of this chapter, the main strength of the online EM algorithm 
is its simplicity and ease of implementation. As discussed above, this is particularly true in the 
presence of constraints on the parameter values. We have also seen in Section f3. 2 1 that Algorithm [J] 
is very robust with respect to the choice of the stepsize j n . In particular, the absolute scale of 
the stepsize is fixed due to the use of a convex combination in (|12p and it is generally sufficient 
to consider stepsizes of the form 7 n = n~ a . We have shown that values of a of the order of 0.6 
(i.e., closer to the lower limit of 0.5 than to the upper one of 1) yield more robust convergence. In 
addition, Polyak-Ruppert averaging can be used to smooth the parameter estimates and reach the 
optimal asymptotic rate of convergence that makes the online estimate equivalent to the actual 
MLE. 

As illustrated by Figure [21 the online EM algorithm is not optimal for short data records of, 
say, less than 100 to 1,000 observations and in this case performing batch mode estimation by 
repeatedly scanning the data is recommended (see Figure [5] for the typical improvement to expect 
from this procedure). The main limitation of the online EM algorithm is to require that the 
M-step update 9 be available in closed-form. In particular, it would be interesting to extend the 
approach to cases where 9 needs to be determined numerically, thus making it possible to handle 
mixt ures of Generalised L i near Models for instance (and not only mixtures of linear regressions 



as m 



Cappe and Moulined . l2009h . 



We finally mention two more directions in which recent works have proposed to extend the 
online EM framework. The first one concerns the case of non-independent observations and, in 
particular, of observations that fol lows a Hidden Markov Mo del (HMM) with Markov depen- 
dencies between successive states. Mongillo and Deneve ( 20081 ) and iGappel (|2009l ) have proposed 
an algorithm for HMMs that appears to be very reminiscent of Algorithm [H although not di- 
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rectly interpretable as a stochastic approximation recursion on the expected sufficient statistics. 
The other topic of importance is motivated by the many cases where the E-step computation 
of Eg ri _ 1 [s(X n , Y n )\ Y n ] is not feasible. This typically occurs when the hidden variable X n is a 
continuous variable. For such cases, a promising soluti on cons i sts in approximating the E-s tep 
using some form of Monte Carlo simulations (see, e.g., Cappe . 2009 . Del Moral et al. . 20091 for 
methods that use sequential Monte Carlo approaches). However much conceptual and theoreti- 
cal work remains to be done, as the consistency result summarised in Section 13.21 only extends 
straightforwardly to the 
the conditionals po n _ 1 (x 1 



(0„-i) 



from 



-rather limited — case of independent Monte Carlo draws X n 
Y n ). In that case, s(X n dn ~ 1 \ Y n ) still provides an unbiased estimate of 



the limiting EM mapping in 6 n -\ and the theory is very similar. In the more general case where 



Markov chain or sequential Monte Carlo simulations are used to produce the draws X n 
convergence of the online estimation procedure needs to be investigated with care. 



(0n-l) 



the 
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